Segmenting Big Data Time Series Stream Data
نویسندگان
چکیده
Big data time series data streams are ubiquitous in finance, meteorology and engineering. It may be impossible to process an entire “big data” continuous data stream or to scan through it multiple times due to its tremendous volume. In Heraclitus’s well-known saying, “you never step in the same stream twice,” and so it is with “big data” temporal data streams. Unlike traditional data sets, big data continuous data streams flow into a computer system continuously, in a non-stationary way and with varying update rates. They are time-stamped, fast-changing, massive, and potentially infinite. Under these circumstances, they represent an application area of growing importance in the data mining research. For example, sensors generate one million samples every minute (Hulten & Domingos, 2003) therefore the primary purpose of time series data stream segmentation is dimensionality reduction. This technique is used in many areas of data stream mining as: frequent patterns finding, structural changes and concept drifts detection (Ge & Smyth, 1999), time series classification and prediction (Hulten & Domingos, 2003), time series similarities searching (Keogh, Chakrabarti, Pazzani, & Mehrotra, 2000), (Park, Kim, & Chu, 2000), etc. The main principle of segmentation algorithms concludes in reducing the big data time series dimensionality by dividing the time axis into intervals behaving approximately according to a simple model. A good big data time series data stream segmentation algorithm must be OFASC (Online, Fast, Accurate, Simple and Comparable). For example the Sliding Window algorithm (Keogh, Chu, Hart, & Pazzani, 2004) on the one hand is online (O), very fast (F) and relatively simple (S) for using in online segmentation applications but on the other hand, it sometimes gives poor accuracy (A) and does not allow to perform online multivariate segmentation (C). Therefore, we will classify this algorithm to OFS segmentation algorithms domain. The segmentation problem can be defined in following way: first, given a time series data stream to produce the best representation such that the maximum error for any segment does not exceed some user specified confidence level error threshold. It is important to add, that using a relative parameter such as confidence level will allow to evaluate an online multivariate segmentation and second, to construct a user friendly segmentation application which will evaluate and compare the proposed online segmentation algorithms in real time. As we shall see in later sections, the stateof-the-art segmentation algorithms do not meet all these requirements. The rest of the paper is organized as follows. In Section 2, we provide a literature review of three state-of-the-art online piecewise linear segmentation algorithms. In Section 3, we provide a methodology for improving the existing stateof-the-art online segmentation algorithms. The proposed methodology based on Hoeffding bound error estimation, which uses a relative probability parameter instead of maximum error nominal parameter and meets the proposed OFASC reDima Alberg SCE Shamoon College of Engineering, Israel
منابع مشابه
PRESEE: An MDL/MML Algorithm to Time-Series Stream Segmenting
Time-series stream is one of the most common data types in data mining field. It is prevalent in fields such as stock market, ecology, and medical care. Segmentation is a key step to accelerate the processing speed of time-series stream mining. Previous algorithms for segmenting mainly focused on the issue of ameliorating precision instead of paying much attention to the efficiency. Moreover, t...
متن کاملAlgorithms for Segmenting Time Series
As with most computer science problems, representation of the data is the key to ecient and eective solutions. Piecewise linear representation has been used for the representation of the data. This representation has been used by various researchers to support clustering, classication, indexing and association rule mining of time series data. A variety of algorithms have been proposed to obtain...
متن کاملOnline Hoeffding Bound Algorithm for Segmenting Time Series Stream Data
In this paper we introduce the ISW (Interval Sliding Window) algorithm, which is applicable to numerical time series data streams and uses as input the combined Hoeffding bound confidence level parameter rather than the maximum error threshold. The proposed algorithm has two advantages: first, it allows performance comparisons across different time series data streams without changing the algor...
متن کاملChapter 1 . Key Technologies for Big Data Stream Computing
1.1 Introduction Big data computing is a new trend for future computing with the quantity of data growing and the speed of data increasing. In general, there are two main mechanisms for big data computing, i.e., big data stream computing and big data batch computing. Big data stream computing is a model of straight through computing, such as Storm [1] and S4 [2] which do for stream computing wh...
متن کاملKey Technologies for Big Data Stream Computing
As a new trend for data-intensive computing, real-time stream computing is gaining significant attention in the Big Data era. In theory, stream computing is an effective way to support Big Data by providing extremely low-latency processing tools and massively parallel processing architectures in real-time data analysis. However, in most existing stream computing environments, how to efficiently...
متن کامل